Introduction

Having a good education is very important for students and their parents. There are many factors that need to think completely. From the huge number of factors that need to be considered, the graduation rate is always one of the most important indicators for their choices. According to INFORMnny, graduation rates in New York State have grown steadily since 2005 and reached 84.8% of students earned diplomas in 2020. In this report, we are trying to visualize the graduation rate of each counties and races in New York State. Our goal is to find how factors such as different counties, students’ genders, and races influence the graduation rate. The result of our project may perform as a guide for up-coming college students to choose which counties are their best choice based on their goals. Moreover, it may provide more insights for educators to improve the quality of education.

Data Description

We will use two datasets. The first one is Graduation Rate Database which contains annual graduation rate, and dropout rate data for the state as well as by county, Need to Resource Capacity group, district, public school and charter school. These datasets are published on data.nysed.gov once a year, and we’ll be analyzing the data from 2014 to 2020. The second dataset is the latitude and longitude of the county boundaries in New York State stored in GeoJSON format.

We preprocessed the data before drawing plots. Here’s the link for raw data: graduation rate, county boundaries. Here’s the link for processed data: processed data.

Plot1: Graduation Rate by Race over time

The plot below is the time series of graduation rates by different races, cohorts, and counties. There are two drop-down menus at the bottom of the graph. You can select the different names of counties and cohorts to see different sets of lines. Another interactive tool is tooltip. You can hangover your mouse on the line to show the exact graduation rate of that group. Different colors highlight the difference between races.

# load the data
grad<- read.csv("https://uwmadison.box.com/shared/static/n0paw3xy9sdanw7c06u0tcxvyqh7gwew")
grad$subgroup_code<- as.factor(grad$subgroup_code)
grad$membership_code<- as.factor(grad$membership_code)
# filter out different races
grad1<- grad %>%
  filter(subgroup_code == "1"|subgroup_code== "5"|subgroup_code== "6"|subgroup_code== "7"|subgroup_code== "8", grad_pct > 10) %>%  
  as_tsibble(index = year, key = c("subgroup_code","membership_code", "county_code"))
# convert to a dd/mm/yy format
grad2<- grad1
grad2$year[grad2$year == "2020"]<- "1/1/2020"
grad2$year[grad2$year == "2019"]<- "1/1/2019"
grad2$year[grad2$year == "2018"]<- "1/1/2018"
grad2$year[grad2$year == "2017"]<- "1/1/2017"
grad2$year[grad2$year == "2016"]<- "1/1/2016"
grad2$year[grad2$year == "2015"]<- "1/1/2015"

#write.csv(grad2, "grad2.csv")

For membership code, it refers to the cohort and the school year being reported. 6 - 6-year outcome, June; 8 - 5-year outcome, June ; 9 - 4-year outcome, June; 10 - 5-year outcome, August; 11 - 4-year outcome, August; 18 - 6-year outcome, August.

Graduation Rate for each Cohort and County

robservable("https://observablehq.com/@zijianzhang/m3", include = 7, height = 400, width = 900)




Maybe it’s hard to summarize the trend of graduation rate on county level, we can see it on state level instead. (Since 5-year outcome, August and 6-year outcome, August only have two years of data, we omit them.)

grad3<- grad1 %>%
  filter(membership_code == "6"|membership_code == "8"|membership_code == "9"|membership_code
         == "11") %>%
  as_tibble() %>%
  group_by(membership_code, subgroup_name, year) %>%
  summarise(grad_pct_mean = mean(grad_pct))
grad3$membership_code<- factor(grad3$membership_code, levels = c(9, 11, 8, 6))

viz<- ggplot(grad3) +
  geom_line(aes(x = year, y = grad_pct_mean, color = subgroup_name), size = 0.8) +
  facet_wrap(~ membership_code, 
             labeller = as_labeller(c(`6` = "6-year outcome, June",
                                      `8` = "5-year outcome, June",
                                      `9` = "4-year outcome, June",
                                      `11` = "4-year outcome, August"))) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  labs(y = "Average Graduation Rate (%)") +
  ggtitle("Average Graduation Rate by Cohorts (NY State level)") +
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom")

gp<- ggplotly(viz)

layout_ggplotly <- function(gg, x = -0.02, y = -0.08){
  # The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
  gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
  gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
  gg
}
gp %>% layout_ggplotly(x = -0.05, y = -0.05)

Plot2: Choropleth Maps

The Choropleth Maps below show the percentage of average values and average standard deviations of graduation rate of each county in New York State respectively. Specifically, the x-axis is the latitude and the y-axis is the longitude, which form maps that can display the coordinates of each county. The colors on the maps represent the quantitative data of the percentage of graduation rate mean and graduation rate standard deviation, the higher value of quantitative data , the darker the color on the map. We can see from the graphs that the percentage of average graduation rate of each county in New York State is different, which may be due to the education level of each county, the strength of the teachers and the students’ own race.

county_mean <- read.csv("https://uwmadison.box.com/shared/static/o1ry3o0jtmxzm8vbbnajitkts8dcnbor.csv") 
names(county_mean)[2] <- "name"
boundary <- read_sf("https://uwmadison.box.com/shared/static/kwdzpyq0k6az4gd9sp05r61u8pn67t7r.geojson") %>%
  filter(state_name == "New York") 
name <- toupper(boundary$name)
boundary$name<- name
county_mean$name<- as.character(county_mean$name)
county_mean[45, 2]<- "ST. LAWRENCE"

df <- left_join(boundary, county_mean, by = "name") 

county_sd <- grad %>%
  group_by(county_name) %>%
  summarise(pct_sd = sd(grad_pct))
names(county_sd)[1] <- "name"
county_sd$name<- as.character(county_sd$name)
county_sd[45, 1]<- "ST. LAWRENCE"

df2<- left_join(df, county_sd, by = "name")

p1 <- ggplot() +
  geom_sf(data = df, aes(fill = pct_mean,text=paste('</br>County: ', str_to_title(name),'</br>Average Graduation Rate: ',round(pct_mean,3))), inherit.aes = FALSE) +
  coord_sf(crs = st_crs(3857)) +
  labs(x = "Longitude", y = "Latitude", fill = "Mean of Graduation Rate") +
  ggtitle("Average Graduation Rate for each County") +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_fill_distiller(palette = "YlOrRd", direction = 1)
ggplotly(,tooltip = "text")
p2 <- ggplot() +
  geom_sf(data = df2, aes(fill = pct_sd, text=paste('</br>County: ', str_to_title(name),'</br>Standard Deviation of Graduation Rate: ',round(pct_sd,3))),inherit.aes = FALSE) +
  coord_sf(crs = st_crs(3857)) +
  #geom_sf_text(data = df, aes(label = str_to_title(name)), check_overlap = T, size = 1.65, col = "blue") +
  labs(x = "Longitude", y = "Latitude", fill = "SD of Graduation Rate") +
  ggtitle("Standard Deviation of Graduation rate for each County") +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_fill_distiller(palette = "RdPu", direction = 1)
ggplotly(,tooltip = "text")

Plot3: K-means & PCA

set.seed(479)
data=read_csv('grad_1.csv')
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   X1 = col_double(),
##   aggregation_index = col_double(),
##   aggregation_code = col_double(),
##   county_code = col_double(),
##   membership_code = col_double(),
##   membership_key = col_double(),
##   subgroup_code = col_double(),
##   enroll_cnt = col_double(),
##   year = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 4917 parsing failures.
##  row        col expected actual         file
## 8172 enroll_cnt a double      - 'grad_1.csv'
## 8177 enroll_cnt a double      - 'grad_1.csv'
## 8185 enroll_cnt a double      - 'grad_1.csv'
## 8186 enroll_cnt a double      - 'grad_1.csv'
## 8189 enroll_cnt a double      - 'grad_1.csv'
## .... .......... ........ ...... ............
## See problems(...) for more details.
f=function(a){
  return(a/100)
}
data2 = data%>%
  dplyr::select(matches('pct|subgroup_name|county_name'))%>%
  mutate_all(funs(str_replace(.,'%','')))%>%
  mutate(across(!c(subgroup_name,county_name),as.numeric))%>%
  drop_na()%>%
  mutate_at((3:10),f)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
pca_rec <- recipe(~., data = data2) %>%
  update_role(subgroup_name,county_name, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())
pca_prep <- prep(pca_rec)

data3=juice(pca_prep)[,1:4]

cluster <- function(data, K) {
  x <- data %>%
    select(PC1,PC2)
  
  kmeans(x, center = K) %>%
    augment(data) %>% # creates column ".cluster" with cluster label
    mutate(silhouette = silhouette(as.integer(.cluster), dist(x))[, "sil_width"])
}

result=cluster(data3,4)
data4=cbind(data2,result$.cluster)

colnames(data4)[11]='cluster'

We used grad_pct, local_pct, reg_pct, regadv_pct, non_diploma_credential_pct, still_enr_pct, ged_pct and dropout_pct to make the k-means plot. The meaning of each variable is below:

Firstly, we dropped the rows with NA and did PCA on these variables. Secondly, we used PC1 and PC2 to train the K-means model. We set K at 4 because we found that this value can separate the data best. Then we made plots among the subgroups, which can show the difference of cluster distribution between different subgroups. The table below shows the average values of the variables of each cluster.

library("readr")
library("tidytext")
components_ <- pca_prep %>%
 tidy(2) %>%
  filter(component %in% str_c("PC", 1:2)) %>%
  mutate(terms = reorder_within(terms, abs(value), component))
components_[,1]=rep(c('Graduation percentage','Local diploma percentage','Regent diploma percentage','Regent diploma with Advanced Designation percentage','Non-diploma percentage','Still enrolled percentage','Equivalency percentage','Drop out percentage'),2)
ggplot(components_, aes(value, terms)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ component, scales = "free_y") +
labs(y = NULL) +
  ggtitle("Components of Principle Component 1 and Principle Component 2") +
  theme_bw() +
  theme(legend.text=element_text(size=10),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))

Interpretation: For principle component 1, graduation percentage contributes the most, but the difference between it and the others is not big. However, Regent diploma with Advanced Designation percentage and Regent diploma percentage contribute much more that the others in principle component 2.

t1=data4 %>%
  group_by(cluster) %>%
  summarise(mean(grad_pct))

t2=data4%>%
  group_by(cluster)%>%
  summarise(mean(local_pct))

t3=data4%>%
  group_by(cluster)%>%
  summarise(mean(reg_pct))

t4=data4%>%
  group_by(cluster)%>%
  summarise(mean(reg_adv_pct))

t5=data4%>%
  group_by(cluster)%>%
  summarise(mean(non_diploma_credential_pct))

t6=data4%>%
  group_by(cluster)%>%
  summarise(mean(still_enr_pct))

t7=data4%>%
  group_by(cluster)%>%
  summarise(mean(ged_pct))

t8=data4%>%
  group_by(cluster)%>%
  summarise(mean(dropout_pct))

colnames(t1)[2]='value'
colnames(t2)[2]='value'
colnames(t3)[2]='value'
colnames(t4)[2]='value'
colnames(t5)[2]='value'
colnames(t6)[2]='value'
colnames(t7)[2]='value'
colnames(t8)[2]='value'
variable=c(rep('grad_pct',4),rep('local_pct',4),rep('reg_pct',4),rep('reg_adv_pct',4),rep('non_diploma_credential_pct',4),rep('still_enr_pct',4),rep('ged_pct',4),rep('dropout_pct',4))

t=cbind(rbind(t1,t2,t3,t4,t5,t6,t7,t8),variable) %>%
  pivot_wider(id_cols = cluster, names_from = variable, values_from = value)

colnames(t) = c('Cluster','Graduation percentatge','Local diploma percentage','Regent diploma percentage','Regent diploma with Advanced Designation percentage','Non-diploma percentage','Still enrolled percentage','Equivalency percentage','Drop out percentage')
library(knitr)
kable(t)
Cluster Graduation percentatge Local diploma percentage Regent diploma percentage Regent diploma with Advanced Designation percentage Non-diploma percentage Still enrolled percentage Equivalency percentage Drop out percentage
1 0.4962500 0.2154416 0.2519124 0.0292590 0.0742328 0.1601984 0.0136639 0.2533945
2 0.9013773 0.0441703 0.3775606 0.4798053 0.0082267 0.0327387 0.0045398 0.0514700
3 0.8653535 0.0641427 0.5240293 0.2772371 0.0123757 0.0386491 0.0059942 0.0757503
4 0.7180578 0.1023912 0.4729104 0.1429690 0.0207496 0.0895510 0.0124029 0.1572458
result %>% dplyr::filter(grepl('All|American|Asian|Black|Hispanic|Multiracial|White', subgroup_name)) %>%
  ggplot(aes(col = .cluster)) +
  geom_point(aes(x=PC1,y=PC2),size=0.3) + 
  facet_wrap(.~subgroup_name)+
  scale_color_brewer(palette = "Set2") +
  labs(x = 'Principle Component 1',
       y = 'Principle Component 2',
       title = 'K-means Plot Group By Race',
       color = 'Cluster')+
  theme_bw() +
  theme(legend.text=element_text(size=10),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))+
  guides(colour = guide_legend(override.aes = list(size=4)))

We can see the race and graduation rate do have correlation and colored race tends to have lower graduation rate because they have more points in cluster 2.

result %>% dplyr::filter(grepl('All|Econo', subgroup_name)) %>%
  ggplot(aes(col = .cluster)) +
  geom_point(aes(x=PC1,y=PC2),size=0.3) + 
  facet_wrap(.~subgroup_name)+
  scale_color_brewer(palette = "Set2") +
  labs(x = 'Principle Component 1',
       y = 'Principle Component 2',
       title = 'K-means Plot Group By Economic Situation',
       color = 'Cluster')+
  theme_bw() +
  theme(legend.text=element_text(size=10),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))+
  guides(colour = guide_legend(override.aes = list(size=4)))

The economically disadvantaged students have less points in cluster 3 which has the best performance in all the indexes above.

There is a strong relationship between English language skill and the graduation rate.

Students in foster care have lower graduation rate than the others.

Students that are migrants still have lower graduation rate than the others.

Homeless students have lower graduation rate than the others.

We don’t see significant difference between the two groups.

Almost all the disabled students are put into cluster 1 and 2 which have the lowest graduation rate and the highest dropout rate.

Conclusion/Summary

For plot 1:

In general, we could find that New York State’s graduation rate has slowly increased between 2015 and 2020, which represents a steady increase in the level of education of New York State during recent years with the development of society as a whole.

  1. The time series of different counties varied greatly. Graduation rates are gradually increasing from 2014 to 2020.

  2. For all kinds of cohorts, it is a little bit surprised that Asian/Pacific Islander have the highest graduation rates among all groups; white people having slightly lower rates. Both are above average. Hispanics and African-Americans are below average, but it is hard to distinguish these two groups. This result confirms the fact that different races have unequal educational opportunities

  3. The shapes of the time series of different cohorts are roughly the same.

For plot 2:

  1. Looking specifically at Choropleth Maps, we could see that Westchester and Suffolk have the highest graduation rates among all counties, which is 80.69% and 80.56% respectively,but their average standard deviation is the lowest around all the counties.

  2. Generally, counties with higher graduation rates will have lower standard deviation of graduation rates. It shows that education inequalities are greatr in counties with low graduation rates than in counties with high graduation rates.

For plot 3:

  1. The plots can show the difference between the subgroups. Race, homelessness, economic situation, disability, language skill and other factors showed in the plots have significant correlation with the students performance at school. More researches are needed to explore how these factors influence the students and find ways to improve such situation.

  2. In terms of personal factors, we conclude that the graduation rate is related to factors such as race, homelessness, financial status, disability, language skills and other factors. Asian/Pacific Islander has a higher graduation rate than other races, while the graduation rate of blacks, Hispanics and African-Americans among all races is relatively low. This may mean that different races have different educational opportunities, or that New York State’s educational methods are more suitable for Asian/Pacific Islanders, so for them, New York State’s high schools could provide them with good educational opportunities. For students with poor English (probably international students), financial burden, migration, foster care, homelessness, and disabilities, the probabilities of graduating for them are relatively low. Most schools in counties of New York State may not have developed a plan for such diverse students to help them graduate better now.

Appendix: Data Curation

#Look at one dataset to know the structure and detect NAs
grad20<- read.csv("https://uwmadison.box.com/shared/static/rdyzsbt2pfwqzbhtq31ks5fw0644xc9c.csv")
skim(grad20)
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
Data summary
Name grad20
Number of rows 227450
Number of columns 37
_______________________
Column type frequency:
factor 26
numeric 11
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
report_school_year 0 1 FALSE 1 201: 227450
aggregation_type 0 1 FALSE 5 Sch: 144138, Dis: 74012, Cou: 8168, Nee: 988
aggregation_name 0 1 FALSE 2104 NEW: 328, BEA: 242, EAR: 236, PAT: 230
entity_inactive_date 0 1 FALSE 5 emp: 226774, 06/: 242, 06/: 232, 06/: 110
lea_name 0 1 FALSE 695 emp: 90094, NEW: 8396, NEW: 3516, NEW: 2686
nrc_desc 0 1 FALSE 8 Ave: 73658, NYC: 60742, Rur: 30180, Low: 28312
county_name 0 1 FALSE 63 KIN: 19740, BRO: 17350, NEW: 17090, SUF: 13854
boces_name 0 1 FALSE 39 NEW: 67024, NAS: 11866, emp: 9300, EAS: 9098
membership_desc 0 1 FALSE 6 201: 38211, 201: 38211, 201: 37824, 201: 37824
subgroup_name 0 1 FALSE 24 All: 12358, Not: 12358, Par: 12358, Not: 12352
grad_cnt 0 1 FALSE 3450 -: 60554, 5: 3210, 6: 2785, 7: 2613
grad_pct 0 1 FALSE 102 -: 60554, 100: 17343, 96%: 8350, 97%: 8251
local_cnt 0 1 FALSE 987 -: 60554, 0: 37547, 1: 21298, 2: 15694
local_pct 0 1 FALSE 96 -: 60554, 0%: 39517, 3%: 10752, 2%: 10582
reg_cnt 0 1 FALSE 2692 -: 60554, 4: 4202, 5: 4179, 3: 4129
reg_pct 0 1 FALSE 102 -: 60554, 50%: 5278, 40%: 4096, 45%: 3975
reg_adv_cnt 0 1 FALSE 1810 -: 60554, 0: 32435, 1: 8086, 2: 6385
reg_adv_pct 0 1 FALSE 102 -: 60554, 0%: 32593, 33%: 2934, 29%: 2736
non_diploma_credential_cnt 0 1 FALSE 329 0: 111899, -: 60554, 1: 23550, 2: 10539
non_diploma_credential_pct 0 1 FALSE 55 0%: 118818, -: 60554, 1%: 18886, 2%: 10217
still_enr_cnt 0 1 FALSE 1015 0: 64067, -: 60554, 1: 28511, 2: 16497
still_enr_pct 0 1 FALSE 100 0%: 67214, -: 60554, 1%: 19327, 2%: 15891
ged_cnt 0 1 FALSE 283 0: 116670, -: 60554, 1: 21502, 2: 9200
ged_pct 0 1 FALSE 41 0%: 121942, -: 60554, 1%: 17915, 2%: 10423
dropout_cnt 0 1 FALSE 1118 -: 60554, 0: 34184, 1: 25065, 2: 17932
dropout_pct 0 1 FALSE 92 -: 60554, 0%: 36283, 2%: 11400, 3%: 11289

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
aggregation_index 0 1.00 3.590000e+00 5.900000e-01 0 3.00000e+00 4.000e+00 4.00000e+00 4 ▁▁▁▅▇
aggregation_code 0 1.00 3.489146e+11 1.782793e+11 0 2.51101e+11 3.313e+11 4.91200e+11 680801040001 ▃▃▇▃▅
lea_beds 90094 0.60 3.512888e+11 1.673100e+11 10100010000 2.80218e+11 3.313e+11 4.41202e+11 800000090459 ▃▇▆▃▂
nrc_code 8312 0.96 3.770000e+00 1.950000e+00 1 1.00000e+00 5.000e+00 5.00000e+00 7 ▇▁▃▇▅
county_code 1132 1.00 3.496000e+01 1.771000e+01 1 2.60000e+01 3.300e+01 4.90000e+01 68 ▃▂▇▃▅
nyc_ind 1132 1.00 3.000000e-01 4.600000e-01 0 0.00000e+00 0.000e+00 1.00000e+00 1 ▇▁▁▁▃
boces_code 9300 0.96 3.434300e+03 1.699020e+03 190 2.49000e+03 3.090e+03 4.59000e+03 6690 ▂▅▇▃▃
membership_code 0 1.00 1.033000e+01 3.760000e+00 6 8.00000e+00 9.500e+00 1.10000e+01 18 ▇▇▅▁▃
membership_key 0 1.00 1.550000e+02 2.580000e+00 152 1.53000e+02 1.545e+02 1.56000e+02 160 ▇▇▅▁▃
subgroup_code 0 1.00 1.204000e+01 7.190000e+00 1 6.00000e+00 1.100e+01 1.80000e+01 25 ▇▇▇▆▆
enroll_cnt 0 1.00 2.728900e+02 3.442000e+03 1 1.00000e+01 4.900e+01 1.21000e+02 210130 ▇▁▁▁▁
gg_miss_upset(grad20)

## clean dataset for each year and select county (may take a long time to read...)
grad20<- grad20 %>%
  filter(aggregation_index == "2") %>%
  dplyr::select(-entity_inactive_date:-nrc_desc, -nyc_ind:-boces_name) %>%
  mutate(year = "2020")

grad19<- read.csv("https://uwmadison.box.com/shared/static/ceqk6p10j1jq060l23jmibs3fr9vryq4.csv") %>%
  filter(aggregation_index == "2") %>%
  dplyr::select(-entity_inactive_date:-nrc_desc, -nyc_ind:-boces_name)%>%
  mutate(year = "2019") 

grad18<- read.csv("https://uwmadison.box.com/shared/static/cfo64w5y2lo7l5tgxsrgymahnwmnr8p2.csv") %>%
  filter(AGGREGATION_INDEX == "2") %>%
  dplyr::select(-LEA_BEDS:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
  mutate(year = "2018")
name<- tolower(colnames(grad18))
colnames(grad18)<- name

grad17<- read.csv("https://uwmadison.box.com/shared/static/qeo58dsbhzj7wz4fn0kr0ejlnygvlvj9.csv") %>%
  filter(AGGREGATION_INDEX == "2") %>%
  dplyr::select(-LEA_BEDS:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
  mutate(year = "2017")
name<- tolower(colnames(grad17))
colnames(grad17)<- name

grad16<- read.csv("https://uwmadison.box.com/shared/static/tokjrznpqbp343elxbotdwr2guzqrtkn.csv") %>%
  filter(AGGREGATION_INDEX == "2") %>%
  dplyr::select(-ENTITY_INACTIVE_DATE:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
  mutate(year = "2016")
name<- tolower(colnames(grad16))
name[1]<- "report_school_year"
colnames(grad16)<- name

grad15<- read.csv("https://uwmadison.box.com/shared/static/cu5yyzj3py2cy342e6erto9a15qjglu4.csv") %>%
  filter(AGGREGATION_INDEX == "2") %>%
  dplyr::select(-LEA_BEDS:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
  mutate(year = "2015")
name<- tolower(colnames(grad15))
colnames(grad15)<- name

grad<- rbind(grad20, grad19, grad18, grad17, grad16, grad15)

# combine above datasets, convert to numeric, remove NAs and "-"
grad %>% 
  mutate_at(c("grad_cnt","local_cnt","reg_cnt","reg_adv_cnt","non_diploma_credential_cnt","still_enr_cnt","ged_cnt","dropout_cnt"),funs(gsub("-", NA, ., fixed = TRUE))) %>%
  mutate_at(c("grad_pct","local_pct","reg_pct","reg_adv_pct","non_diploma_credential_pct","still_enr_pct","ged_pct","dropout_pct"),funs(as.numeric(gsub("%", "", ., fixed = TRUE))/100)) %>%
  drop_na()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
cols.num <- c(seq(13, 30, 1))
grad[cols.num] <- sapply(grad[cols.num],as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
# unified the name of categories
grad[grad == "Asian or Pacific Islander"]<- "Asian/Pacific Islander"
grad[grad == "Hispanic or Latino"]<- "Hispanic"
grad[grad == "American Indian or Alaska Native"]<- "American Indian/Alaska Native"
grad[grad == "Black or African American"]<- "Black"

# convert it to a tsibble object
grad_tsibble<- as_tsibble(grad, index = year, key = c("subgroup_code","membership_code", "county_code"))

write.csv(grad_tsibble, "grad.csv")

# average graudation rate for each county
county_grad <- grad %>%
  group_by(county_name) %>%
  summarise(pct_mean = mean(grad_pct))